This report is submitted by Greeshma Jeev Koothuparambil and Olayemi Morrison as a part of Laboratory 4 of Visualization (732A98) Course for the 2023 Autumn Semester.
Assignment 1
Following are the libraries used for the successful completion of this assignment:
plotly
dplyr
seriation
Here is how we loaded our libraries:
library(plotly)
library(dplyr)
library(seriation)
1. For further analysis, import data to R and keep only the columns with the following numbers: 1,2,5,6,7,9,10,16,17,18,19. Use the first column as labels in further analysis.
#reading the file.
df <- read.table("prices-and-earnings.txt", header = TRUE, sep = "\t")
The loaded dataframe looks like this:
| City | Food.Costs… | Womens.Clothing… | Mens.Clothing… | iPhone.4S.hr. | Clothing.Index | Hours.Worked |
|---|---|---|---|---|---|---|
| Amsterdam | 364 | 690 | 1040 | 44.5 | 110.8 | 1755 |
| Athens | 390 | 630 | 1110 | 86.0 | 112.5 | 1822 |
| Auckland | 497 | 560 | 670 | 51.0 | 79.2 | 1852 |
| Bangkok | 422 | 400 | 600 | 165.0 | 64.2 | 2312 |
| Barcelona | 394 | 580 | 1110 | 52.5 | 109.2 | 1761 |
| Beijing | 463 | 660 | 700 | 184.0 | 87.5 | 1979 |
| Wage.Gross | Wage.Net | Vacation.Days | COL..Excl..rent. | COL..incl..rent. | Pur.Power.Gross | Pur.Power.Net |
|---|---|---|---|---|---|---|
| 78.3 | 69.4 | 24 | 77.0 | 69.0 | 101.6 | 90.1 |
| 41.4 | 40.0 | 23 | 66.1 | 58.1 | 62.6 | 60.5 |
| 59.8 | 63.5 | 20 | 76.7 | 67.7 | 78.0 | 82.9 |
| 14.6 | 17.4 | 7 | 55.3 | 48.1 | 26.5 | 31.4 |
| 59.6 | 58.7 | 29 | 74.7 | 65.6 | 79.7 | 78.6 |
| 17.0 | 18.0 | 9 | 60.3 | 51.8 | 28.3 | 29.9 |
| Pur.Power.Annual | Big.Mac.min. | Bread.kg.in.min. | Rice.kg.in.min. | Goods.and.Services… | Good.and.Services.Index | Food..Index |
|---|---|---|---|---|---|---|
| 75.7 | 16 | 7 | 9 | 3034 | 77.0 | 66.0 |
| 52.1 | 30 | 13 | 26 | 2605 | 66.1 | 70.7 |
| 74.8 | 16 | 17 | 8 | 3019 | 76.7 | 90.0 |
| 33.7 | 36 | 26 | 20 | 2178 | 55.3 | 76.5 |
| 66.8 | 19 | 12 | 6 | 2941 | 74.7 | 71.3 |
| 28.2 | 34 | 28 | 16 | 2375 | 60.3 | 83.9 |
It has 72 observations and 21 variables. Now we are trimming the dataframe to required variable list:
df1 <- df[, c(1,2,5,6,7,9,10,16,17,18,19)]
The modified dataframe looks like this:
| City | Food.Costs… | iPhone.4S.hr. | Clothing.Index | Hours.Worked | Wage.Net |
|---|---|---|---|---|---|
| Amsterdam | 364 | 44.5 | 110.8 | 1755 | 69.4 |
| Athens | 390 | 86.0 | 112.5 | 1822 | 40.0 |
| Auckland | 497 | 51.0 | 79.2 | 1852 | 63.5 |
| Bangkok | 422 | 165.0 | 64.2 | 2312 | 17.4 |
| Barcelona | 394 | 52.5 | 109.2 | 1761 | 58.7 |
| Beijing | 463 | 184.0 | 87.5 | 1979 | 18.0 |
| Vacation.Days | Big.Mac.min. | Bread.kg.in.min. | Rice.kg.in.min. | Goods.and.Services… |
|---|---|---|---|---|
| 24 | 16 | 7 | 9 | 3034 |
| 23 | 30 | 13 | 26 | 2605 |
| 20 | 16 | 17 | 8 | 3019 |
| 7 | 36 | 26 | 20 | 2178 |
| 29 | 19 | 12 | 6 | 2941 |
| 9 | 34 | 28 | 16 | 2375 |
It has 72 observations and 11 variables namely:
City, Food.Costs…, iPhone.4S.hr., Clothing.Index, Hours.Worked, Wage.Net, Vacation.Days, Big.Mac.min., Bread.kg.in.min., Rice.kg.in.min., Goods.and.Services…
The Summary of the table is as follows:
## City Food.Costs... iPhone.4S.hr. Clothing.Index
## Amsterdam: 1 Min. :186.0 Min. : 22.00 Min. : 26.70
## Athens : 1 1st Qu.:315.5 1st Qu.: 44.50 1st Qu.: 57.50
## Auckland : 1 Median :413.0 Median : 80.75 Median : 91.25
## Bangkok : 1 Mean :423.5 Mean :116.41 Mean : 89.01
## Barcelona: 1 3rd Qu.:497.5 3rd Qu.:162.75 3rd Qu.:113.55
## Beijing : 1 Max. :927.0 Max. :435.00 Max. :199.20
## (Other) :66
## Hours.Worked Wage.Net Vacation.Days Big.Mac.min.
## Min. :1558 Min. : 8.10 Min. : 6.00 Min. : 9.00
## 1st Qu.:1785 1st Qu.: 21.98 1st Qu.:15.00 1st Qu.:15.75
## Median :1853 Median : 41.30 Median :22.00 Median :20.00
## Mean :1913 Mean : 48.69 Mean :20.38 Mean :28.38
## 3rd Qu.:1995 3rd Qu.: 71.50 3rd Qu.:25.00 3rd Qu.:36.00
## Max. :2375 Max. :132.40 Max. :30.00 Max. :84.00
##
## Bread.kg.in.min. Rice.kg.in.min. Goods.and.Services...
## Min. : 6.00 Min. : 6.00 Min. :1304
## 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:2126
## Median :14.00 Median :13.00 Median :2686
## Mean :17.40 Mean :15.71 Mean :2720
## 3rd Qu.:20.25 3rd Qu.:20.25 3rd Qu.:3121
## Max. :70.00 Max. :41.00 Max. :4569
##
From the summary it is seen that the values varies from 6 to 4569 which calls for scaling of the data.
rownames(df1) <- df1[, 1]
df1 <- df1[, -1]
df1scaled <- scale(df1)
The Summary of the table is as follows:
## Food.Costs... iPhone.4S.hr. Clothing.Index Hours.Worked
## Min. :-1.79764 Min. :-0.9889 Min. :-1.63900 Min. :-1.8878
## 1st Qu.:-0.81745 1st Qu.:-0.7532 1st Qu.:-0.82878 1st Qu.:-0.6797
## Median :-0.07948 Median :-0.3735 Median : 0.05904 Median :-0.3196
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.56011 3rd Qu.: 0.4854 3rd Qu.: 0.64566 3rd Qu.: 0.4340
## Max. : 3.81100 Max. : 3.3370 Max. : 2.89876 Max. : 2.4555
## Wage.Net Vacation.Days Big.Mac.min. Bread.kg.in.min.
## Min. :-1.3242 Min. :-2.2435 Min. :-1.0483 Min. :-0.9825
## 1st Qu.:-0.8715 1st Qu.:-0.8389 1st Qu.:-0.6831 1st Qu.:-0.6379
## Median :-0.2410 Median : 0.2536 Median :-0.4531 Median :-0.2932
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7443 3rd Qu.: 0.7218 3rd Qu.: 0.4126 3rd Qu.: 0.2453
## Max. : 2.7312 Max. : 1.5022 Max. : 3.0097 Max. : 4.5320
## Rice.kg.in.min. Goods.and.Services...
## Min. :-1.1154 Min. :-1.94869
## 1st Qu.:-0.7707 1st Qu.:-0.81749
## Median :-0.3112 Median :-0.04627
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.5218 3rd Qu.: 0.55254
## Max. : 2.9058 Max. : 2.54582
Now that the data is scaled it is time for further analysis.
2. Plot a heatmap of the data without doing any reordering. Is it possible to see clusters, outliers?
p <- plot_ly(x=colnames(df1scaled), y=rownames(df1scaled), z=df1scaled, type = "heatmap", colors = colorRamp(c("yellow", "red")))
The heatmap without reordering looks like this:
After plotting the graph, it was not possible to identify any clusters or outliers.
3. Compute distance matrices by a) using Euclidian distance and b) as one minus correlation. For both cases, compute orders that optimize Hamiltonian Path Length and use Hierarchical Clustering (HC) as the optimization algorithm. Plot two respective heatmaps and state which plot seems to be easier to analyse and why. Make a detailed analysis of the plot based on Euclidian distance. Use Euclidian Distance matrix in all coming steps.
#Method 1: using the Euclidean distance method
rowdist<-dist(df1scaled, method = "euclidean")
coldist<-dist(t(df1scaled), method = "euclidean")
#Order using Hierachical Cluster
order1<-seriate(rowdist, "GW")
order2<-seriate(coldist, "GW")
ord1<-get_order(order1)
ord2<-get_order(order2)
reordmatr<-df1scaled[rev(ord1),ord2]
#Method 2: Using one minus correlation
cor_rowdist <- as.dist(1 - cor(df1scaled))
cor_coldist <- as.dist(1 - cor(t(df1scaled)))
#Order using Hierachical Cluster
cor_order1<-seriate(cor_rowdist, "GW")
cor_order2<-seriate(cor_coldist, "GW")
cor_ord1<-get_order(cor_order1)
cor_ord2<-get_order(cor_order2)
cor_reordmatr<-df1scaled[rev(cor_ord2),cor_ord1]
# Create heatmap plots for both HC and Hamiltonian-based ordering
p1 <- plot_ly(x=colnames(reordmatr), y=rownames(reordmatr),
z=reordmatr, type="heatmap", colors =colorRamp(c("yellow", "red"))) %>%
subplot (plot_ly(x=colnames(cor_reordmatr), y=rownames(cor_reordmatr),
z=cor_reordmatr, type="heatmap", colors =colorRamp(c("yellow", "red"))), nrows = 2
)
The resulting graph looks interesting with Euclidean at the top and One minus correlation at the bottom:
Upon comparing both heatmaps, it is easier to identify clusters in the plot based on the Euclidean distance than on the plot based on one minus correlation. The Euclidean plot displays a better ordering across the variables, which produces a big cluster from Copenhagen to Belin. A smaller cluster can be identified from Toronto to miami, with the variables showing a similar shade across the variables. Some outliers were identified as well, from London to Torronto we can find Istanbul exhibiting abnormal trend in Big.Mac.min, IPhone.4s.hr, and the Hours.Worked and also Dubai with abnormal values in Hours.Worked and Clothing.Index . Another outlier can be found from Beijing to Manama, which is Ljubljana as there are high values in Rice.kg.in.min.
The clusters are as follows:
The outliers are as follows:
4. Compute a permutation that optimizes Hamiltonian Path Length but uses Traveling Salesman Problem (TSP) as solver. Compare the heatmap given by this reordering with the heatmap produced by the HC solver in the previous step – which one seems to be better? Compare also objective function values such as Hamiltonian Path length and Gradient measure achieved by row permutations of TSP and HC solvers (Hint: use criterion() function)
#The Euclidean distance method already established in rowdist & coldist variables
#Order using TSP Solver
order3<-seriate(rowdist, "TSP")
order4<-seriate(coldist, "TSP")
ord3<-get_order(order3)
ord4<-get_order(order4)
reordmatr3<-df1scaled[rev(ord3),ord4]
# Create a heatmap plot showing TSP plot against GW
p3 <- plot_ly(x=colnames(reordmatr3), y=rownames(reordmatr3),
z=reordmatr3, type="heatmap", colors =colorRamp(c("yellow", "red"))) %>%
subplot(
plot_ly(x = colnames(reordmatr), y = rownames(reordmatr),
z = reordmatr, type = "heatmap", colors = colorRamp(c("yellow", "red"))),
nrows = 2
)
The resulting graph is as follows:
For computing the Hamiltonian Path Length and Gradient measure we use the following formula:
#Compute HPL and GM for HC
hpl_hc <- criterion(rowdist, ord1, method = "Path_length")
gm_hc <- criterion(rowdist, ord1, method = "Gradient_raw")
#Compute HPL and GM for TSP
hpl_tsp <- criterion(rowdist, ord3, method = "Path_length")
gm_tsp <- criterion(rowdist, ord3, method = "Gradient_raw")
The Hamiltonian Path Length and Gradient measure for HC are:
## Path_length
## 127.3152
## Gradient_raw
## 41612
and for TSP are:
## Path_length
## 121.0256
## Gradient_raw
## 37770
The heatmap produced by the TSP produces a better reordering of the variables, as opposed to the HC solver in the previous step. It is easier to identify clusters and outliers in the TSP plot than in the plot based on the HC solver, and displays a smoother gradient overall. We can also see that the Hamiltonian Path Length and the Gradient measure produce lower values in the TSP plot than in the HC plot.
5. Use Ploty to create parallel coordinate plots from unsorted data and try to permute the variables in the plot manually to achieve a better clustering picture. After you are ready with this, brush clusters by different colors and comment about the properties of the clusters: which variables are important to define these clusters and what values of these variables are specific to each cluster. Can these clusters be interpreted? Find the most prominent outlier and interpret it.
#Creating the parallel coordinate plots from unsorted data:
dims=list()
for( i in 1:ncol(df1scaled)){
dims[[i]]=list( label=colnames(df1scaled)[i],
values=as.formula(paste("~",colnames(df1scaled)[i])))
}
p4 <- as.data.frame(df1scaled) %>%
plot_ly(type = 'parcoords',
dimensions = dims
)
The parallel coordinate plot is as follows:
Upon manual permutation of the variables, we noticed some interesting patterns:
There was mostly a negative correlation between Goods.and.Services and Rice.Kg.in.min, as well as a positive correlation between Bread.Kg.in.min and Clothing index. Our final output is shown in the image below:
The cluster we found is as shown below:
Cluster in the plot
Close review revealed that the variables iPhone.4S.hr, Bread.Kg.in.min, Rice.Kg.in.min Big.Mac.min are important to define this cluster. A prominent outlier was observed with extreme values identified on 3 variables, namely Food.Costs, Goods.and.Services. and Clothing.Index. This suggests a unique data point that deviates from the typical pattern, especially when compared to the clusters. See an image below:
6. Use the data obtained by using the HC solver and create a radar chart diagram with juxtaposed radars. Identify two smaller clusters in your data (choose yourself which ones) and the most distinct outlier.
#Juxtaposed Radar charts
stars(df1,key.loc=c(15,2), draw.segments=F, col.stars =rep("Yellow", nrow(df1)))
From observations of similar shapes in the Radar chart, two clusters have been identified as follows:
Cluster 1: Bratislava, Frankfurt, Kiev, Helsinki, Manama & Sofia.
Cluster 2: Caracas, Los Angeles, Luxembourg, Milan, Stockholm, Zurich.
The most prominent outlier is Beijing, as it has it’s own unique shape.
7. Which of the tools you have used in this assignment (heatmaps, parallel coordinates or radar charts) was best in analyzing these data? From which perspective? (e.g. efficiency, simplicity, etc.)
The best tool used in analyzing data was the heatmaps, as the graphs produced were simpler, allowing the use of pre-attentive processing to quickly identify clusters and outliers. In was also the most efficient in computing various reordering techniques for comparison.
Assignment 2
Following are the libraries used for the successful completion of this assignment:
ggplot2
gridExtra
dplyr
ggpubr
plotly
Here is how we loaded our libraries:
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
The file is in CSV format without header and this is how we loaded the file and renamed the variables:
#Loading the adult.csv
df <-read.csv("adult.csv", header = F)
colnames(df) <- c("age", "workclass","fnlwgt","education","educationnum",
"maritalstatus","occupation","relationship","race","sex",
"capitalgain","capitalloss","hoursperweek","nativecountry","Incomelevel")
The loaded dataframe looks like this:
| age | workclass | fnlwgt | education | educationnum |
|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 |
| 38 | Private | 215646 | HS-grad | 9 |
| 53 | Private | 234721 | 11th | 7 |
| 28 | Private | 338409 | Bachelors | 13 |
| 37 | Private | 284582 | Masters | 14 |
| maritalstatus | occupation | relationship | race | sex |
|---|---|---|---|---|
| Never-married | Adm-clerical | Not-in-family | White | Male |
| Married-civ-spouse | Exec-managerial | Husband | White | Male |
| Divorced | Handlers-cleaners | Not-in-family | White | Male |
| Married-civ-spouse | Handlers-cleaners | Husband | Black | Male |
| Married-civ-spouse | Prof-specialty | Wife | Black | Female |
| Married-civ-spouse | Exec-managerial | Wife | White | Female |
| capitalgain | capitalloss | hoursperweek | nativecountry | Incomelevel |
|---|---|---|---|---|
| 2174 | 0 | 40 | United-States | <=50K |
| 0 | 0 | 13 | United-States | <=50K |
| 0 | 0 | 40 | United-States | <=50K |
| 0 | 0 | 40 | United-States | <=50K |
| 0 | 0 | 40 | Cuba | <=50K |
| 0 | 0 | 40 | United-States | <=50K |
It has 32561 observations and 15 variables.
The Summary of the table is as follows:
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education educationnum maritalstatus
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capitalgain capitalloss hoursperweek
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## nativecountry Incomelevel
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
1. Use ggplot2 to make a scatter plot of Hours per Week versus age where observations are colored by Income level. Why it is problematic to analyze this plot? Make a trellis plot of the same kind where you condition on Income Level. What new conclusions can you make here?
#First Plot
p1 <-ggplot(df,aes(x=hoursperweek, y=age, color=Incomelevel)) +
geom_point()+
ggtitle("Dependency of Hours Per Week over Age based on Income Level")
p2 <-ggplot(df,aes(x=hoursperweek, y=age, color=Incomelevel)) +
geom_point()+facet_grid(Incomelevel~.)+
ggtitle("Dependency of Hours Per Week over Age based on Income Level")
The Scatter plot looks like this:
The Trellis plot looks like this:
There is a handicap of overplotting in the scatter plot. Even though there are only 2 levels of colors it is strenuous for the human eye to draw a separation between the two levels with a large number of observations being plotted in the same graph. The idea of distinguishing the levels and making individual analysis on each level fails through this graph.
The analysis through the trellis plot makes it easier than the first plot. From the plots we can easily distinguish those who earn more than 50K from those who do not. Even though there is a case of overplotting existing in both the graphs in the second plot, major analysis on the levels can be made. From the plot we can see that in the case of people who earn less than 50K are larger in number than those who earn more than 50K. Most of the lower income producers have to work for around 0 to 60 hours per week to earn their income while most of the higher earners have working hours ranging from 35 to 60 hours. Most of the high income employees are of age between 25 to 75 and that for people with less that 50K income have ages all over from 17 to 80 and only less people work above 80. One of the most interesting facts is that the concentration of older people working for more than 60 hours tends to grow lower.
2. Use ggplot2 to create a density plot of age grouped by the Income level. Create a trellis plot of the same kind where you condition on Marital Status. Analyze these two plots and make conclusions.
#Second Plot
p3 <- ggplot(df, aes(x=age, colour = Incomelevel, fill = Incomelevel)) +
geom_density(alpha =0.5)
p4 <- ggplot(df, aes(x=age, colour = Incomelevel, fill = Incomelevel)) +
geom_density(alpha =0.5) +facet_grid(maritalstatus~.)
plot34 <- subplot(p3,p4, nrows = 2)
The density plot is as follows:
The Trellis plot looks like this:
Density:
From the first density plot we can see that the age of both the groups are skewed to the left. The age of the workers who earn less than 50K is heavily skewed to the left, with a higher number of people of age less than 25. Those who earn higher than 50K are mostly of age between 37 and 47. From the graph it could be assumed that with growing age there is a probability of earning more than 50K for youngsters. In both groups fewer people work after 65.
Trellis:
In all the graphs of trellis except for never married and widowed the high earner category exhibits higher density. The never married-low earners graph shows a heavy concentration of people till around 28 and is heavily skewed to the lower age group while the never married high earners are skewed to the left but has a wide spectrum ranging from 20 to 60.
The “widowed” for both the earning categories are on par with density and age. There is a heavy concentration of separated couples of ages ranging from 17 to 65. For lower earners the “divorced” has a wide spectrum while the high earners exhibit comparatively a narrow spectrum with a peak between the age of 42 to 50. Most of the earners are married to civilians. An interesting pattern was observed in the case of those who married an Armed Force person. Those who earn above 50K have a higher density of those married to armed force personnel. In the case of those who earn less than 50K and are of age between 65 to 80 has an abnormal concentration of people being married to armed personnel. IT may be because either during their days it was a trend for low earners to marry an army employee or it could have been a time of war where most of the people were employed in the army. Since the data is from the year 1994, people between 65 to 80 were at a marriageable age during the second world war; which actually makes sense in the analysis. In the case of those people who are married but their spouse absent for low earners the spectrum in wide with small peak while the high earners after 37 age shows a rise in the same case till nearly the age of 47 and then show a decrease in the trend.
3. Filter out all observations having Capital loss equal to zero. For the remaining data, use Plotly to create a 3D-scatter plot of Education-num vs Age vs Captial Loss. Why is it difficult to analyze this plot? Create a trellis plot with 6 panels in ggplot2 in which each panel shows a raster-type 2d-density plot of Capital Loss versus Education-num conditioned on values of Age (use cut_number() ) . Analyze this plot.
#Third Plot
filterdf <- df%>% filter(capitalloss != 0)
p5 <- plot_ly(filterdf, x=~educationnum, y=~age, z=~capitalloss, type="scatter3d")
p6 <- ggplot(filterdf, aes(x=capitalloss, y=educationnum) ) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +facet_grid(cut_number(age,6)~.)
The plots are as follows:
The 3D scatter plot is heavily overplotted. Since it has only 1 color there is nothing distinguishable to observe from the graph. Even closer inspection would seem impossible due to the lack of distinguishable features.
The trellis plot shows the capital loss encountered by people with different educational qualifications across different ages. Irrespective of the age for those with education number above 8 has a high density of capital loss between 1500 to 2500. These people are of high educational qualifications starting from High School. The density of capital loss in people of above 54 seems lower compared to other age groups.
4. Make a trellis plot containing 4 panels where each panel should show a scatter plot of Capital Loss versus Education-num conditioned on the values of Age by a) using cut_number() b) using Shingles with 10% overlap. Which advantages and disadvantages you see in using Shingles?
#Fourth Plot
#cut_number() graph
p7 <-ggplot(df, aes(x=capitalloss, y=educationnum)) +
geom_point()+facet_grid(cut_number(age,4)~.)
#Getting shingles
Agerange<-lattice::equal.count(df$age, number=4, overlap=0.1)
L<-matrix(unlist(levels(Agerange)), ncol=2, byrow = T)
L1<-data.frame(Lower=L[,1],Upper=L[,2], Interval=factor(1:nrow(L)))
index=c()
Class=c()
for(i in 1:nrow(L)){
Cl=paste("[", L1$Lower[i], ",", L1$Upper[i], "]", sep="")
ind=which(df$age>=L1$Lower[i] &df$age<=L1$Upper[i])
index=c(index,ind)
Class=c(Class, rep(Cl, length(ind)))
}
shingledf <- df[index,]
shingledf$Agerange<-as.factor(Class)
#Shingles Graph
p8 <-ggplot(shingledf, aes(x=capitalloss, y=educationnum)) +
geom_point()+facet_grid(Agerange~.) +
ggtitle("Dependency of Capital loss over Education Num based on Age")
p9 <- subplot(p7,p8, nrows = 2)
The plots are as follows with cut_number() at the top and shingles at the bottom:
The plot conditioned by cut_number() is more organised and neat. It is easy to interpret between each age interval. But it is difficult to understand the trend of the data with this plotting method.
The plot using shingles seems over-plotted at first glance. But on closer inspection the data for closer ages can be interpreted and trends could be seen. The shingles do not affect the first plot. So the first plot for both cut_number() and shingles are identical.
For the first assignment coding and the Analysis was done by Olayemi. As for the second assignment coding and the Analysis was done by Greeshma Jeev. We both went through the outputs and the analysis to make our own suggestions to the results inorder to make this report a grand success. As for most of the coding in this assignment templates were already available we both found more time in discussing and defending our analysis and findings in the assignment.
The RMD file was designed together and coded by Greeshma Jeev. Content writing was done by both Olayemi and Greeshma Jeev.
APPENDIX
library(plotly)
library(dplyr)
library(seriation)
#Question 1
# Reading the the file in to the dataframe
df <- read.table("prices-and-earnings.txt", header = TRUE, sep = "\t")
#The Summary of the table is as follows:
summary(df)
#Extracting the required columns for analysis:
df1 <- df[, c(1,2,5,6,7,9,10,16,17,18,19)]
#The Summary of the updated table is as follows:
summary(df1)
#Using the first column as a label for further analysis
rownames(df1) <- df1[, 1]
df1 <- df1[, -1]
df1scaled <- scale(df1)
#The Summary of the table after scaling is as follows:
summary(df1scaled)
#Question 2
#Plotting a heatmap with no ordering
p <- plot_ly(x=colnames(df1scaled), y=rownames(df1scaled), z=df1scaled, type = "heatmap", colors = colorRamp(c("yellow", "red")))
#Question 3
#Method 1: using the Euclidean distance method
rowdist<-dist(df1scaled, method = "euclidean")
coldist<-dist(t(df1scaled), method = "euclidean")
#Order using Hierachical Cluster
order1<-seriate(rowdist, "GW")
order2<-seriate(coldist, "GW")
ord1<-get_order(order1)
ord2<-get_order(order2)
reordmatr<-df1scaled[rev(ord1),ord2]
#Method 2: Using one minus correlation
cor_rowdist <- as.dist(1 - cor(df1scaled))
cor_coldist <- as.dist(1 - cor(t(df1scaled)))
#Order using Hierachical Cluster
cor_order1<-seriate(cor_rowdist, "GW")
cor_order2<-seriate(cor_coldist, "GW")
cor_ord1<-get_order(cor_order1)
cor_ord2<-get_order(cor_order2)
cor_reordmatr<-df1scaled[rev(cor_ord2),cor_ord1]
# Create heatmap plots for both HC and Hamiltonian-based ordering
p1 <- plot_ly(x=colnames(reordmatr), y=rownames(reordmatr),
z=reordmatr, type="heatmap", colors =colorRamp(c("yellow", "red"))) %>%
subplot (plot_ly(x=colnames(cor_reordmatr), y=rownames(cor_reordmatr),
z=cor_reordmatr, type="heatmap", colors =colorRamp(c("yellow", "red"))), nrows = 2
)
#Question 4
#The Euclidean distance method already established in rowdist & coldist variables
#Order using TSP Solver
order3<-seriate(rowdist, "TSP")
order4<-seriate(coldist, "TSP")
ord3<-get_order(order3)
ord4<-get_order(order4)
reordmatr3<-df1scaled[rev(ord3),ord4]
# Create a heatmap plot showing TSP plot against GW
p3 <- plot_ly(x=colnames(reordmatr3), y=rownames(reordmatr3),
z=reordmatr3, type="heatmap", colors =colorRamp(c("yellow", "red"))) %>%
subplot(
plot_ly(x = colnames(reordmatr), y = rownames(reordmatr),
z = reordmatr, type = "heatmap", colors = colorRamp(c("yellow", "red"))),
nrows = 2
)
#Compute HPL and GM for HC
hpl_hc <- criterion(rowdist, ord1, method = "Path_length")
gm_hc <- criterion(rowdist, ord1, method = "Gradient_raw")
#Compute HPL and GM for TSP
hpl_tsp <- criterion(rowdist, ord3, method = "Path_length")
gm_tsp <- criterion(rowdist, ord3, method = "Gradient_raw")
#Question 5
#Creating the parallel coordinate plots from unsorted data:
dims=list()
for( i in 1:ncol(df1scaled)){
dims[[i]]=list( label=colnames(df1scaled)[i],
values=as.formula(paste("~",colnames(df1scaled)[i])))
}
p4 <- as.data.frame(df1scaled) %>%
plot_ly(type = 'parcoords',
dimensions = dims
)
#Question 6
#Juxtaposed Radar charts
stars(df1,key.loc=c(15,2), draw.segments=F, col.stars =rep("Yellow", nrow(df1)))
computing various reordering techniques for comparison.
setwd("R")
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
#Loading the adult.csv
df <-read.csv("adult.csv", header = F)
colnames(df) <- c("age", "workclass","fnlwgt","education","educationnum",
"maritalstatus","occupation","relationship","race","sex",
"capitalgain","capitalloss","hoursperweek","nativecountry","Incomelevel")
#First Plot
p1 <-ggplot(df,aes(x=hoursperweek, y=age, color=Incomelevel)) +
geom_point()+
ggtitle("Dependency of Hours Per Week over Age based on Income Level")
p2 <-ggplot(df,aes(x=hoursperweek, y=age, color=Incomelevel)) +
geom_point()+facet_grid(Incomelevel~.)+
ggtitle("Dependency of Hours Per Week over Age based on Income Level")
#Second Plot
p3 <- ggplot(df, aes(x=age, colour = Incomelevel, fill = Incomelevel)) +
geom_density(alpha =0.5)
p4 <- ggplot(df, aes(x=age, colour = Incomelevel, fill = Incomelevel)) +
geom_density(alpha =0.5) +facet_grid(maritalstatus~.)
plot34 <- subplot(p3,p4, nrows = 2)
#Third Plot
filterdf <- df%>% filter(capitalloss != 0)
p5 <- plot_ly(filterdf, x=~educationnum, y=~age, z=~capitalloss, type="scatter3d")
p6 <- ggplot(filterdf, aes(x=capitalloss, y=educationnum) ) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +facet_grid(cut_number(age,6)~.)
#Fourth Plot
#cut_number() graph
p7 <-ggplot(df, aes(x=capitalloss, y=educationnum)) +
geom_point()+facet_grid(cut_number(age,4)~.)
#Getting shingles
Agerange<-lattice::equal.count(df$age, number=4, overlap=0.1)
L<-matrix(unlist(levels(Agerange)), ncol=2, byrow = T)
L1<-data.frame(Lower=L[,1],Upper=L[,2], Interval=factor(1:nrow(L)))
index=c()
Class=c()
for(i in 1:nrow(L)){
Cl=paste("[", L1$Lower[i], ",", L1$Upper[i], "]", sep="")
ind=which(df$age>=L1$Lower[i] &df$age<=L1$Upper[i])
index=c(index,ind)
Class=c(Class, rep(Cl, length(ind)))
}
shingledf <- df[index,]
shingledf$Agerange<-as.factor(Class)
#Shingles Graph
p8 <-ggplot(shingledf, aes(x=capitalloss, y=educationnum)) +
geom_point()+facet_grid(Agerange~.) +
ggtitle("Dependency of Capital loss over Education Num based on Age")
p9 <- subplot(p7,p8, nrows = 2)